SPDE-RF (spatial only)

Published

October 21, 2024

Introduction

This example presents the analysis using a combination of Bayesian inference (INLA) and Random Forest (RF). It shows the result of the first loop of the algorithm for this combined data analysis. To this end, various scenarios with spatial dependence and different structures of non-linearity in the covariates are simulated.

These two scenarios are defined by non-linear effects:

  • strong non-linearity scenario,
  • weak non-linearity scenario.

In both scenarios, there will be a categorical variable with three levels (A, B, C), a spatial effect, and two covariates. In the strong non-linearity scenario, both covariates will exhibit a non-linear structure, whereas in the second scenario, only one of the covariates will have a non-linear relationship.

Data simulation

The model for data simulation is defined as following:

\[\mathbf{y} = \mathbf{X}\boldsymbol\beta + \sum_{k\in K} \mathbf{f}_k(\mathbf{z}_k) + \mathbf{u}_s + \boldsymbol\varepsilon\] where \(\mathbf{X}\boldsymbol\beta\) are the covariates with their linear effects: the categorical variable and the covariate with linear effect in the weak non-linearity scenario. The non-linear effects of the covariates are captured by \(\sum_{k\in K} \mathbf{f}_k(\mathbf{z}_k)\), such that \(K=1\) stands for the weak non-linearity scenario, and \(K=2\) represents the strong non-linearity scenario. Finally, \(\mathbf{u}_s\) is the spatial effect and \(\boldsymbol\varepsilon\) is the Gaussian noise of the observations. Using this structure, \(1000\) data will be simulated at random locations within the study region.

For data simulation, we will first define and simulate the components of the model in the following order

  • defining the study region and the mesh for simulating the spatial effect (SPDE-FEM),
  • simulating the spatial effect and the categorical variable (common to both scenarios), and
  • simulating the non-linear effects for each scenario.

These components constitute the linear predictor of the model.

Defining the study region and the mesh for the spatial simulation

A. Original boundary and the internal boundary to define the mesh for the SPDE-FEM approach.

B. Mesh for the simulation of the spatial effect using the SPDE-FEM approach.

Simulating the spatial effect (SPDE2) and the categorical variable

A. Spatial effect along the study region.

B. Spatial effect values in the sample locations.

C. Levels of the categorial variable in the sample locations.

Simulation of the response variable under the strong non-linearity case

A. Second variable (X2) under the strong non-linearity.

B. Third variable (X3) under the strong non-linearity.

C. Response variable under strong non-linearity in the sample locations.

Simulation of the response variable under the weak non-linearity case

A. Second variable (X2) under the weak non-linearity.

B. Third variable (X3) under the weak non-linearity.

C. Response variable under weak non-linearity in the sample locations.

Analysis of the data under the strong non-linearity case

The analysis of the simulated data will be conducted according to the two scenarios. In each scenario, the following procedure will be followed:

  1. Split of the data into two train/test sets (the test set is the \(20\%\) of the observations).
  2. Perform Bayesian inferential analysis, considering two models: simple and complex.
    1. The simple model assumes that the effects of the covariates are linear.
    2. The complex model considers a non-linear structure for the effects of the non-linear covariates.
  3. Compute the residuals using the mean of the posterior marginal distribution of the expectation for each data point in the training and test sets.
  4. Analyze the point estimates of the residuals using Random Forest (RF). Two different strategies will be followed:
    1. using the values of the covariates and the Cartesian coordinates of the observation locations, or
    2. using the mean values of the marginal posterior distributions of the non-linear effects and the spatial effect (to capture the geometry of the marginal posterior distributions).
  5. Compare the RMSE for the train and test sets based on the results from Bayesian inference (INLA) or from combining Bayesian inference with residual analysis using RF (INLA-RF).

Additionally, it would be possible to use the entire marginal posterior distribution of the expectation for the residuals, instead of the mean of this distribution, as a proxy for the residuals

Simple INLA model and RF combined analysis (first loop only)

A. Empirical-real density of the predictive errors and the closest-optimal Gaussian PDF.

B. Derivative of KLD(p||q) and KLD value for the empirical and closest-optimal Gaussian .
RMSE for the train and test datasets. The obtained values are compared by evaluating the data using INLA and using a combination of INLA and RF, where the geometry of the information is not shared between INLA and RF (running only the first loop of the algorithm).
Train Test
INLA 1.2267 1.6937
INLA-RF 0.6678 1.0797

A. Empirical-real density of the predictive errors and the closest-optimal Gaussian PDF.

B. Derivative of KLD(p||q) and KLD value for the empirical and closest-optimal Gaussian .
RMSE for the train and test datasets. The obtained values are compared by evaluating the data using INLA and using a combination of INLA and RF, where the geometry of the information is shared between INLA and RF (running only the first loop of the algorithm).
Train Test
INLA 1.2267 1.6937
INLA-RF 0.6252 1.0619

Complex INLA model and RF combined analysis, sharing geometry information (first loop only)

A. Simulated (red) and inferred (black) non-linear effect (X2) in the strong non-linearity scenario. The inferred effect shows the mean of the posterior distribution in a solid black line and the credible interval (q1-q3) is shown with a gray shadow area.

B. Simulated (red) and inferred (black) non-linear effect (X3) in the strong non-linearity scenario. The inferred effect shows the mean of the posterior distribution in a solid black line and the credible interval (q1-q3) is shown with a gray shadow area.

A. Empirical-real density of the predictive errors and the closest-optimal Gaussian PDF.

B. Derivative of KLD(p||q) and KLD value for the empirical and closest-optimal Gaussian .
RMSE for the train and test datasets. The obtained values are compared by evaluating the data using INLA and using a combination of INLA and RF, where the geometry of the information is not shared between INLA and RF (running only the first loop of the algorithm).
Train Test
INLA 0.2765 0.5506
INLA-RF 0.2791 0.5227

A. Empirical-real density of the predictive errors and the closest-optimal Gaussian PDF.

B. Derivative of KLD(p||q) and KLD value for the empirical and closest-optimal Gaussian .
RMSE for the train and test datasets. The obtained values are compared by evaluating the data using INLA and using a combination of INLA and RF, where the geometry of the information is not shared between INLA and RF (running only the first loop of the algorithm).
Train Test
INLA 0.2765 0.5506
INLA-RF 0.2616 0.5554

Analysis of the data under the weak non-linearity case

Simple INLA model and RF combined analysis (first loop only)

A. Empirical-real density of the predictive errors and the closest-optimal Gaussian PDF.

B. Derivative of KLD(p||q) and KLD value for the empirical and closest-optimal Gaussian .
RMSE for the train and test datasets. The obtained values are compared by evaluating the data using INLA and using a combination of INLA and RF, where the geometry of the information is not shared between INLA and RF (running only the first loop of the algorithm).
Train Test
INLA 0.1758 0.5631
INLA-RF 0.1571 0.5499

A. Empirical-real density of the predictive errors and the closest-optimal Gaussian PDF.

B. Derivative of KLD(p||q) and KLD value for the empirical and closest-optimal Gaussian .
RMSE for the train and test datasets. The obtained values are compared by evaluating the data using INLA and using a combination of INLA and RF, where the geometry of the information is shared between INLA and RF (running only the first loop of the algorithm).
Train Test
INLA 0.1758 0.5631
INLA-RF 0.1472 0.5336

Complex INLA model and RF combined analysis, sharing geometry information (first loop only)

A. Simulated (red) and inferred (black) non-linear effect (X2) in the weak non-linearity scenario. The inferred effect shows the mean of the posterior distribution in a solid black line and the credible interval (q1-q3) is shown with a gray shadow area.

A. Empirical-real density of the predictive errors and the closest-optimal Gaussian PDF.

B. Derivative of KLD(p||q) and KLD value for the empirical and closest-optimal Gaussian .
RMSE for the train and test datasets. The obtained values are compared by evaluating the data using INLA and using a combination of INLA and RF, where the geometry of the information is not shared between INLA and RF (running only the first loop of the algorithm).
Train Test
INLA 0.0733 0.4850
INLA-RF 0.0784 0.4829

A. Empirical-real density of the predictive errors and the closest-optimal Gaussian PDF.

B. Derivative of KLD(p||q) and KLD value for the empirical and closest-optimal Gaussian .
RMSE for the train and test datasets. The obtained values are compared by evaluating the data using INLA and using a combination of INLA and RF, where the geometry of the information is not shared between INLA and RF (running only the first loop of the algorithm).
Train Test
INLA 0.0733 0.4850
INLA-RF 0.0733 0.4878